3.5B
3.5B
collection
eye 3.5B
Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites. Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for...
2B
2.0B
collection
eye 2B
Survey crawls are run about twice a year, on average, and attempt to capture the content of the front page of every web host ever seen by the Internet Archive since 1996.
Topic: survey crawls
1.7B
1.7B
collection
eye 1.7B
Content crawled via the Wayback Machine Live Proxy mostly by the Save Page Now feature on web.archive.org. Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.
647.6M
648M
collection
eye 647.6M
Survey crawl of domains started April 2013. This data is currently not publicly accessible.
468.5M
469M
collection
eye 468.5M
Web wide crawl with initial seedlist and crawler configuration from April 2013.
411.9M
412M
collection
eye 411.9M
Wayback indexes. This data is currently not publicly accessible.
411.3M
411M
collection
eye 411.3M
Survey crawl of domains started December 2014. This data is currently not publicly accessible.
336.6M
337M
collection
eye 336.6M
Web wide crawl with initial seedlist and crawler configuration from June 2014.
310.7M
311M
collection
eye 310.7M
306.8M
307M
collection
eye 306.8M
Web wide crawl with initial seedlist and crawler configuration from January 2015.
295.2M
295M
collection
eye 295.2M
Web wide crawl with initial seedlist and crawler configuration from August 2013.
283.1M
283M
collection
eye 283.1M
Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.
257.4M
257M
collection
eye 257.4M
Survey crawl of domains. This data is currently not publicly accessible.
252.1M
252M
collection
eye 252.1M
Web wide crawl with initial seedlist and crawler configuration from April 2012.
233.9M
234M
collection
eye 233.9M
Survey crawl of domains. This data is currently not publicly accessible.
207.9M
208M
collection
eye 207.9M
Survey crawl of domains started May 2014. This data is currently not publicly accessible.
187.8M
188M
collection
eye 187.8M
184.5M
184M
collection
eye 184.5M
Web wide crawl with initial seedlist and crawler configuration from October 2010
182.5M
182M
collection
eye 182.5M
Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
181.6M
182M
collection
eye 181.6M
Web wide crawl with initial seedlist and crawler configuration from September 2012.
161.9M
162M
collection
eye 161.9M
Web wide crawl with initial seedlist and crawler configuration from March 2011 using HQ software.
158.3M
158M
collection
eye 158.3M
Data crawled by Sloan Foundation on behalf of Internet Archive
149.6M
150M
collection
eye 149.6M
Web wide crawl with initial seedlist and crawler configuration from February 2014.
146.6M
147M
collection
eye 146.6M
Survey crawl of .com domains started January 2011.
Topic: webcrawl
145.1M
145M
collection
eye 145.1M
Crawl of outlinks from wikipedia.org started March, 2016. These files are currently not publicly accessible. Properties of this collection. It has been several years since the last time we did this. For this collection, several things were done: 1. Turned off duplicate detection. This collection will be complete, as there is a good chance we will share the data, and sharing data with pointers to random other collections, is a complex problem. 2. For the first time, did all the different wikis....
144.7M
145M
collection
eye 144.7M
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi. What’s in the data set: Crawl start date: 09 March, 2011 Crawl end date: 23 December, 2011 Number of captures: 2,713,676,341 Number of unique URLs: 2,273,840,159 Number of hosts: 29,032,069 The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT)...
115.7M
116M
collection
eye 115.7M
Crawl of outlinks from wikipedia.org started February, 2012. These files are currently not publicly accessible.
80.4M
80M
collection
eye 80.4M
A daily crawl of more than 200,000 home pages of news sites, including the pages linked from those home pages. Site list provided by The GDELT Project
Topics: GDELT, News
77M
77M
collection
eye 77M
60.4M
60M
collection
eye 60.4M
Shallow crawls that collect content 1 level deep including embeds. This data is currently not publicly accessible.
59.2M
59M
collection
eye 59.2M
Crawls of International News Sites
53M
53M
collection
eye 53M
Crawl of outlinks from wikipedia.org started May, 2011. These files are currently not publicly accessible.
37.1M
37M
collection
eye 37.1M
Crawl of outlinks from wikipedia.org started July, 2011. These files are currently not publicly accessible.
33M
33M
collection
eye 33M
Hacker News Crawl of their links.
30.1M
30M
collection
eye 30.1M
This collection includes web crawls of the Federal Executive, Legislative, and Judicial branches of government performed at the end of US presidential terms of office.
Topics: web, end of term, US, federal government
29.4M
29M
collection
eye 29.4M
CDX Index shards for the Wayback Machine. The Wayback Machine works by looking for historic URL's based on a query. This is done by searching an index of all the web objects (pages, images, etc) that have been archived over the years. This collection holds the index used for this purpose, which is broken up into 300 pieces so they fit into items more naturally and distribute the lookup load. Each of these 300 pieces is stored in at least 2 items, and then those are also stored on the backup...
25.6M
26M
collection
eye 25.6M
COM survey crawl data collected by Internet Archive in 2009-2010. This data is currently not publicly accessible.
24.5M
24M
collection
eye 24.5M
Geocities crawl performed by Internet Archive. This data is currently not publicly accessible. from Wikipedia : Yahoo! GeoCities is a Web hosting service. GeoCities was originally founded by David Bohnett and John Rezner in late 1994 as Beverly Hills Internet (BHI), and by 1999 GeoCities was the third-most visited Web site on the World Wide Web. In its original form, site users selected a "city" in which to place their Web pages. The "cities" were metonymously named after...
23.3M
23M
collection
eye 23.3M
Shallow crawl started 2013 that collects content 1 level deep, including embeds. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
22.6M
23M
collection
eye 22.6M
Shallow crawl started 2013 that collects content 1 level deep, including embeds. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
19.4M
19M
collection
eye 19.4M
this data is currently not publicly accessible.
17.7M
18M
collection
eye 17.7M
Survey crawl of .net domains started December 2010.
Topic: webcrawl
15.9M
16M
Nov 12, 2013
11/13
by
ximm@archive.org
collection
eye 15.9M
Miscellaneous high-value news sitesÂ
Topics: World news, US news, news
15.7M
16M
Dec 13, 2012
12/12
by
ximm@archive.org
collection
eye 15.7M
15.1M
15M
collection
eye 15.1M
Survey crawl of domains. This data is currently not publicly accessible.
14.9M
15M
collection
eye 14.9M
This collection contains web crawls performed on the US Federal Executive, Legislative & Judicial branches of government in 2012-2013.
Topics: end of term, US, Federal government, 2012, Obama
14.7M
15M
collection
eye 14.7M
Captures of pages from YouTube. Currently these are discovered by searching for YouTube links on Twitter.
Topics: YouTube, Twitter, Video
12.5M
13M
collection
eye 12.5M
Crawl of International News Sites with initial seedlist and crawler configuration from Sep 1, 2010.
10.8M
11M
collection
eye 10.8M
Survey of .org domains. This data is currently not publicly accessible.
9.5M
9.5M
May 3, 2011
05/11
by
Internet Archive
web
eye 9.5M
favorite 0
comment 0
Internet Archive Liveweb Capture from WaybackMachine, captured by wwwb-proxy0.us.archive.org:wbm from Sun Mar 27 22:10:09 PDT 2011 to Mon Mar 28 05:27:05 PDT 2011.
Topic: crawldata
9.3M
9.3M
collection
eye 9.3M
Shallow crawl started February 2013 that collects content 1 level deep, including embeds. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
7.5M
7.5M
collection
eye 7.5M
Survey crawl of .net domains started October 2011.
Topics: webwidecrawl, net
6.9M
6.9M
collection
eye 6.9M
TEST COLLECTION: Crawl of .edu and .gov sites started in June 2010.
Topic: crawldata
6.4M
6.4M
collection
eye 6.4M
End of term 2008 crawl data gathered by Internet Archive on behalf of the California Digital Library. This data is currently not publicly accessible.
5.4M
5.4M
collection
eye 5.4M
Crawl Data. This data is currently not publicly accessible.
5.4M
5.4M
collection
eye 5.4M
Crawl data. This data is currently not publicly accessible.
5.3M
5.3M
collection
eye 5.3M
2004 Election crawl performed by Internet Archive. This data is currently not publicly accessible.
5.2M
5.2M
web
eye 5.2M
favorite 0
comment 0
5.1M
5.1M
collection
eye 5.1M
Web wide crawl with initial seedlist and crawler configuration from September 2010
5M
5.0M
collection
eye 5M
Web data related to World Wars I and II collected by Internet Archive in an experimental crawl sponsored by National Endowment for the Humanities and JISC. This data is currently not publicly accessible.
4.5M
4.5M
collection
eye 4.5M
Data crawled from YouTube.com in 2007 by Internet Archive. These files are not currently accessible.
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Tue Jan 17 08:02:53 PST 2012 to Tue Jan 17 01:16:20 PST 2012.
Topic: crawldata
4M
4.0M
collection
eye 4M
Shallow crawl started November 2012 that collects content 1 level deep, including embeds. This data is currently not publicly accessible.
3.9M
3.9M
collection
eye 3.9M
3.9M
3.9M
collection
eye 3.9M
Data related to Hurricane Katrina collected in 2005 by Internet Archive. This data is currently not publicly accessible. from Wikipedia : Hurricane Katrina was the deadliest and most destructive Atlantic hurricane of the 2005 Atlantic hurricane season. It was the costliest natural disaster, as well as one of the five deadliest hurricanes, in the history of the United States. Among recorded Atlantic hurricanes, it was the sixth strongest overall. At least 1,833 people died in the hurricane and...
3.8M
3.8M
collection
eye 3.8M
End of term 2008 crawl of .gov domains gathered by University of North Texas . This data is currently not publicly accessible. UNT is a student-focused, public, research university located in Denton, Texas.
3.7M
3.7M
May 8, 2011
05/11
by
Internet Archive
web
eye 3.7M
favorite 0
comment 0
Internet Archive Liveweb Capture from WaybackMachine, captured by wwwb-proxy0.us.archive.org:wbm from Sun May 8 07:07:52 PDT 2011 to Sun May 8 08:00:29 PDT 2011.
Topic: crawldata
3.6M
3.6M
web
eye 3.6M
favorite 0
comment 0
Data crawled by Internet Archive on behalf of Internet Archive from Fri Nov 01 06:23:33 PDT 2002 to Tue Nov 19 23:24:07 PDT 2002
Topic: crawldata
3.5M
3.5M
web
eye 3.5M
favorite 0
comment 0
Data crawled by National Endowment for the Humanities and JISC on behalf of Internet Archive from Fri Aug 08 00:17:40 PDT 2008 to Thu Jun 26 05:29:33 PDT 2008
Topic: crawldata
3.4M
3.4M
May 3, 2011
05/11
by
Internet Archive
web
eye 3.4M
favorite 0
comment 0
Internet Archive Liveweb Capture from WaybackMachine, captured by wwwb-proxy0.us.archive.org:wbm from Mon Mar 28 12:43:47 PDT 2011 to Mon Mar 28 16:56:17 PDT 2011.
Topic: crawldata
3M
3.0M
May 20, 2011
05/11
by
Internet Archive
web
eye 3M
favorite 0
comment 0
Internet Archive Liveweb Capture from WaybackMachine, captured by wwwb-proxy0.us.archive.org:wbm from Fri May 20 00:54:34 PDT 2011 to Fri May 20 04:55:47 PDT 2011.
Topic: crawldata
2.9M
2.9M
May 20, 2011
05/11
by
Internet Archive
web
eye 2.9M
favorite 0
comment 0
Internet Archive Liveweb Capture from WaybackMachine, captured by wwwb-proxy0.us.archive.org:wbm from Thu May 19 17:19:06 PDT 2011 to Thu May 19 17:46:28 PDT 2011.
Topic: crawldata
2.8M
2.8M
May 4, 2011
05/11
by
Internet Archive
web
eye 2.8M
favorite 0
comment 0
Internet Archive Liveweb Capture from WaybackMachine, captured by wwwb-proxy0.us.archive.org:wbm from Tue Mar 29 00:12:24 PDT 2011 to Tue Mar 29 07:24:41 PDT 2011.
Topic: crawldata
2.7M
2.7M
Apr 4, 2013
04/13
by
Internet Archive
web
eye 2.7M
favorite 0
comment 0
Internet Archive Liveweb Capture from WayBack Machine, captured by wwwb-live1.us.archive.org from 2013-03-29T09:54:55 UTC to 2013-03-29T13:00:26 UTC.
Topic: crawldata
2.6M
2.6M
collection
eye 2.6M
This collection contains web crawls performed as part of the End of Term Web Archive, a collaborative project that aims to preserve the U.S. federal government web presence at each change of administration. Content includes publicly-accessible government websites hosted on .gov, .mil, and relevant non-.gov domains, as well as government social media materials. The web archiving was performed in the Fall and Winter of 2016 and Spring of 2017. For more information, see...
Topics: end of term, federal government, 2016, president, congress, government data